gh-148706: Pure-Python pickle._Pickler can be significantly optimized by mjbommar · Pull Request #148707 · python/cpython

mjbommar · 2026-04-18T00:53:40Z

As discussed in #148706 , packages like dill can be made ~20-40% faster by optimizing various paths in the pure Python pickler that they subclass.

Note: I included a number of experiment logs and microbenchmark data for reviewers to examine; should obviously be removed before merging, but I didn't see any documentation on an alternative for sharing this kind of info.

A few key points, most of which are provided in more detail in the Misc/pickle-perf-diary.md doc and Misc/pickle-perf-data/ results:

No new failures with downstream packages like attrs, dill, cloudpickle, joblib, etc. (3.15 already has some regressions)
We could cherrypick smaller improvements one at a time; see the diary file to see how they individually contribute

Some representative benchmarks:

  Main workload (pickle_pure_bench.py — in Misc/pickle-perf-data/):

  ┌──────────────────────┬────────┬───────────┬────────┐
  │        Bench         │  main  │ F6 branch │   Δ    │
  ├──────────────────────┼────────┼───────────┼────────┤
  │ list_of_ints_10k     │ 0.7647 │    0.4721 │ −38.3% │
  ├──────────────────────┼────────┼───────────┼────────┤
  │ list_of_strs_1k      │ 0.3220 │    0.2574 │ −20.1% │
  ├──────────────────────┼────────┼───────────┼────────┤
  │ dict_str_int_5k      │ 1.0872 │    0.7857 │ −27.7% │
  ├──────────────────────┼────────┼───────────┼────────┤
  │ deep_list            │ 2.7034 │    1.3783 │ −49.0% │
  ├──────────────────────┼────────┼───────────┼────────┤
  │ nested_list_of_dicts │ 2.0137 │    1.2735 │ −36.8% │
  └──────────────────────┴────────┴───────────┴────────┘

  Bytes workload (pickle_pure_bench_bytes.py):

  ┌─────────────────────────┬────────┬────────┬────────────────────────────────────┐
  │          Bench          │ pre-F6 │   F6   │                 Δ                  │
  ├─────────────────────────┼────────┼────────┼────────────────────────────────────┤
  │ list_of_short_bytes_5k  │ 1.5257 │ 1.3631 │                             −10.7% │
  ├─────────────────────────┼────────┼────────┼────────────────────────────────────┤
  │ list_of_medium_bytes_1k │ 0.2943 │ 0.2597 │                             −11.7% │
  ├─────────────────────────┼────────┼────────┼────────────────────────────────────┤
  │ dict_bytes_to_int_2k    │ 0.8423 │ 0.7658 │                              −9.1% │
  ├─────────────────────────┼────────┼────────┼────────────────────────────────────┤
  │ list_of_bytearrays_1k   │ 0.3094 │ 0.3107 │ +0.4% (bytearray not in fast path) │
  └─────────────────────────┴────────┴────────┴────────────────────────────────────┘

  dill dumps bench (5 shapes):

  ┌──────────────────────┬─────────┬─────────┬────────┐
  │        Shape         │  main   │   F6    │   Δ    │
  ├──────────────────────┼─────────┼─────────┼────────┤
  │ flat_list_of_ints_5k │ 0.217 s │ 0.136 s │ −37.3% │
  ├──────────────────────┼─────────┼─────────┼────────┤
  │ list_of_strs_1k      │ 0.070 s │ 0.057 s │ −18.8% │
  ├──────────────────────┼─────────┼─────────┼────────┤
  │ dict_str_int_3k      │ 0.358 s │ 0.261 s │ −27.0% │
  ├──────────────────────┼─────────┼─────────┼────────┤
  │ list_of_tuples_2k    │ 0.491 s │ 0.391 s │ −20.3% │
  ├──────────────────────┼─────────┼─────────┼────────┤
  │ mixed_records_500    │ 0.413 s │ 0.269 s │ −34.8% │
  └──────────────────────┴─────────┴─────────┴────────┘

Assisted-by: gpt-5-4-xhigh
Assisted-by: opus-4-7

Issue: Pure-Python pickle._Pickler can be significantly optimized #148706

…/ _batch_setitems The C accelerator has batch_list_exact (Modules/_pickle.c:3179) and batch_dict_exact (Modules/_pickle.c:3455) that skip the generic batched() + enumerate() overhead for exact list / dict instances under binary protocols. The pure-Python _batch_appends and _batch_setitems didn't have equivalents. This adds: _batch_appends_exact(obj) — slice-snapshot per batch, no itertools.batched generator, no enumerate() object. _batch_setitems_exact(obj) — for dicts <= BATCHSIZE, a single direct iteration of obj.items(); for larger dicts, a materialized list with sliced batches. Activated only when type(obj) is list / dict and self.bin is true. Other types and proto 0 still use the generic paths unchanged. Pure-Python dump numbers (best-of-9 median per bench): list_of_ints_10k dump -3.1% load -2.6% list_of_strs_1k dump -1.7% load -3.2% dict_str_int_5k dump -0.2% load -3.5% (>BATCHSIZE) deep_list dump -17.2% load -0.3% nested_list_of_dicts dump -22.1% load -1.1% Load deltas are unrelated micro-noise; the load path was not touched. Correctness: test_pickle 1060/1060 pass test_pickletools 202/202 pass test_copy 83/83 pass test_copyreg 6/6 pass test_importlib 1217/1217 pass dill 0.4.1 29/30 pass (pre-existing 3.15a8 compat) cloudpickle 3.1.2 236/236 + 22 skipped + 2 xfailed (identical) The test_evil_pickler_mutating_collection case motivated using a per-batch slice snapshot for lists (so concurrent mutation doesn't raise IndexError) and relying on dict's built-in size-change check for dicts.

…ave() Two follow-ups to the exact-container fast paths (94b53eb): D) Inline framer.commit_frame() at the top of save(). The hot check (current_frame is None, or tell() < _FRAME_SIZE_TARGET) runs on every save() call; skipping the Python method dispatch when no commit is needed removes a measurable per-call tax on long runs. E) Short-circuit the dispatch-table dict.get for the common atomic types (str, int, NoneType, bool, float) by matching type(obj) with direct `is` checks before falling through to `self.dispatch.get(t)`. Placed after the memo and reducer_override checks so semantics for repeated strings, subclasses, and custom reducers are unchanged. Pure-Python dump numbers (best-of-9 median per bench, vs `main` with no pickle patches): list_of_ints_10k dump -11.7% load -3.9% list_of_strs_1k dump -8.9% load -3.4% dict_str_int_5k dump -9.9% load -4.4% deep_list dump -24.6% load -2.9% nested_list_of_dicts dump -28.5% load -2.8% (Cumulative with the exact-container fast paths from 94b53eb.) The load deltas are unrelated noise; the load path was not touched. Correctness: test_pickle 1060/1060 pass test_pickletools 202/202 pass test_copy 83/83 pass test_copyreg 6/6 pass test_importlib 1217/1217 pass dill 0.4.1 29/30 pass (pre-existing 3.15a8 incompat) cloudpickle 3.1.2 236/236 + 22 skipped + 2 xfailed (identical) Investigations that did NOT ship: B) Hoisting persistent_id / reducer_override hook checks to __init__ — precomputed bool + __dict__ probe is strictly more work than the original self.persistent_id(obj) call, which hits the type-attribute cache. Rejected twice on clean measurements. C) Atomic-tuple memoize skip — semantically safe but changes byte- exact pickle output, breaking test_pickle_to_2x's fixture assertion. Deferred; would require updating DATA_SET2 / DATA_XRANGE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents the pure-Python _Pickler fast-path investigation that produced commits 94b53eb and bb9d721. Mirrors the structure of Misc/marshal-perf-diary.md. Covers five experiments (A-E) — what shipped (Exp 4 exact-container fast paths; D inlined commit_frame; E atomic-type is-short-circuit), what was rejected (B hook hoisting — Python's type-attribute cache beats manual short-circuits), and what was deferred (C atomic-tuple memoize skip — blocked by byte-exact pickle-output test fixture). Includes the kernel-compile-contamination lesson and the methodology that caught it on the clean rerun. Adds Misc/pickle-perf-data/ with the six canonical JSON artifacts plus the bench / profile scripts used throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(F1) Two changes: 1) _batch_setitems_exact large-dict branch (`n > _BATCHSIZE`) used to materialise `list(obj.items())` upfront, silently losing the `RuntimeError("dictionary changed size during iteration")` raised by dict_items when mutation happens during save() (e.g. from a persistent_id hook). This is the exact-dict semantics regression flagged in review. Fix: delegate to the generic `_batch_setitems`, which iterates through `batched()` over the live items view and preserves the detection. The small-dict `n <= _BATCHSIZE` fast path already used direct `for k, v in items:` iteration which triggers the same check. 2) Reorder save()'s fast paths to match the dispatch order in Modules/_pickle.c::save(): atomic types (str / int / None / bool / float) are tested *before* the memo.get()/reducer_override probe. The C reference skips both for these types — str because save_str handles memoisation inline, the others because they're never memoized. Pure Python now does the same: - str: inline memo.get check, then save_str on miss (preserves dedup) - int / None / bool / float: direct save_*, no memo.get, no reducer_override probe Pure-Python dump numbers vs clean `main` (best-of-9 median): list_of_ints_10k dump -38.6% load -3.8% list_of_strs_1k dump -17.2% load -3.6% dict_str_int_5k dump -25.9% load -4.2% deep_list dump -44.4% load -3.1% nested_list_of_dicts dump -34.9% load -2.8% Load deltas are unrelated noise. The semantic change (skip reducer_override for atomic types) matches the C pickler's long- established behaviour. test_pickle 1060/1060, test_pickletools 202/202, test_copy 83/83, cloudpickle 236/236 identical breakdown. Large-dict mutation detection verified via a local reproducer (RuntimeError raised as expected). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

save_long for small non-negative ints did `BININT1 + pack("<B", obj)` per call. Precompute the 256 two-byte opcode sequences once at module import (about 50 KB) and use a tuple index instead. Eliminates a _struct.pack call on every such save. Pure-Python dump numbers vs the previous commit (F1): list_of_ints_10k dump -0.0% (values 0-9999, only 256 hit the cache) list_of_strs_1k dump -1.4% dict_str_int_5k dump -1.4% deep_list dump -9.9% ([i]*10 for i in 0..499 — every value hits the cache) nested_list_of_dicts dump -1.5% Cumulative vs clean `main`: -18% to -50% dump depending on workload. test_pickle 1060/1060, cloudpickle 236/236. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

memoize() used to call self.put(idx) which returns an opcode bytestring and then write it. For protocol >= 4 (the common case for any recent user), put() just returns the single-byte MEMOIZE constant. Inline the protocol dispatch directly into memoize() to skip the method call and the temporary-bytes indirection. Also caches `memo = self.memo` as a local to avoid two attribute loads on the common path. Pure-Python dump numbers vs the previous commit (F4): list_of_ints_10k dump +1.0% (ints not memoized; noise) list_of_strs_1k dump -1.9% dict_str_int_5k dump -1.2% deep_list dump -1.3% nested_list_of_dicts dump -1.6% Cumulative vs clean `main`: -20% to -51% dump. put() method kept for subclass override compatibility. test_pickle 1060/1060. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `bytes` to the save() dispatch fast path, placed AFTER the four non-memoized atomics (int / None / bool / float) so int-heavy workloads don't pay an extra type-identity branch before hitting their direct dispatch. bytes is memoized, so memo.get is inlined alongside the save_bytes call the same way str is. Dump deltas on a bytes-heavy workload: list_of_short_bytes_5k dump -10.7% list_of_medium_bytes_1k dump -11.7% dict_bytes_to_int_2k dump -9.1% list_of_bytearrays_1k dump +0.4% (bytearray isn't caught; hits dispatch table) Main-bench impact (F2 -> F6v2), within ±0.7% across all benches. bytearray specialisation is a separate question; rare in real payloads. test_pickle 1060/1060. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Expands Misc/pickle-perf-diary.md to cover the round-2 experiments: F1 SHIPPED — atomic fast paths before memo (matches C pickler order) F2 SHIPPED — inline MEMOIZE proto-4+ in memoize() F3 REJECTED — frame byte counter (Python bookkeeping loses to BytesIO.tell()'s C method call) F4 SHIPPED — cache BININT1 opcode bytes for n in 0..255 F5 REJECTED — ASCII save_str (Python utf-8 encoder already fast for pure-ASCII) F6 SHIPPED — bytes in save() fast path F7 DEFERRED — exact-set batching (needs set workload) Plus the large-dict mutation-detection regression in Exp 4 that review flagged, fixed as part of F1. Updated "What we learned" with four new lessons distilled from the failures (F3, F5) and the order-matching insight from F1. Added a "What we validated this session" section citing full CPython regression suite passing (48,928 tests), dill 29/30 (pre-existing 3.15a8 incompat), cloudpickle 243/243 identical to baseline, joblib focused subset identical to baseline. Misc/pickle-perf-data/ adds nine round-2 JSON artifacts plus the bytes-specific bench script, and README.md documents which file represents which branch state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Announces the pure-Python _Pickler speedup stack (Exp 4 + D + E + F1 + F2 + F4 + F6, seven commits preceding this one). Highlights the dill impact and flags the single semantic change (atomic types skip reducer_override to match the C reference implementation).

sergey-miryanov · 2026-04-18T08:05:09Z

I think the best way to share the data and experimental results would be to put them in a dedicated repository and link to it here.

Additionally, I'd recommend splitting this PR into smaller, more focused PRs—one optimization per PR. Before proceeding, it would be a good idea to get feedback from @serhiy-storchaka.

Wulian233 · 2026-04-18T10:04:01Z

don't add json again and again

mjbommar and others added 9 commits April 17, 2026 15:54

bedevere-app bot added the awaiting review label Apr 18, 2026

bedevere-app bot mentioned this pull request Apr 18, 2026

Pure-Python pickle._Pickler can be significantly optimized #148706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-148706: Pure-Python pickle._Pickler can be significantly optimized#148707

gh-148706: Pure-Python pickle._Pickler can be significantly optimized#148707
mjbommar wants to merge 9 commits intopython:mainfrom
mjbommar:exp-pickle/4-pure-python-exact-containers

mjbommar commented Apr 18, 2026 •

edited

Loading

Uh oh!

sergey-miryanov commented Apr 18, 2026

Uh oh!

Wulian233 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mjbommar commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergey-miryanov commented Apr 18, 2026

Uh oh!

Wulian233 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mjbommar commented Apr 18, 2026 •

edited

Loading